In this blog we’ll discuss few techniques for doing data analysis.
Once we read the dataset, we should always check for null values. The following line of code can be used for this:
df.isnull().sum(axis=0)
It gives the total number of null values for each column in the dataset.
We can check whether there are duplicate values in the dataset using the following command:
df.duplicated().sum()
It gives us the total number of duplicated rows. We can drop duplicates using the following command:
df.drop_duplicates(keep=False)
This is useful for categorical data. The number of unique values can be checked in a particular column using the following code:
df['column_name'].unique()
Dealing with null values is also very important for data analysis. We can either remove the column if there are too many null values
df['column_name'].dropna(inplace=True)
or we can impute them with mean/median/mode.
df['column'].fillna(df['column'].mean(),inplace=True)
This gives us the information about the correlation between different features. The following command can be used for it:
1
2
3
4
import seaborn as sns
corrmat = df.corr()
f, ax = plt.subplots(figsize = (12, 9))
sns.heatmap(corrmat, square = True);
We are using seaborn package to plot the matrix.
We can also check for the distribution of a particular column. By plot we can understand whether the data is following a normal distribution or some other. Following is the command to use:
1 2
import seaborn as sns sns.distplot(df['column'])
We can check the datatype of columns in the data. It helps us to confirm whether data is the correct data type or not.
df.dtypes
Many times we get dates in a column but not in datetime format. We can convert them into the correct format by using the following code:
df["Date.of.Birth"] = df['Date.of.Birth'].astype('datetime64[ns]')
We can check the frequency of different categories in a categorical column:
df['column'].value_counts()
We can find out the maximum and minimum values in a particular column using max() and min() command:
df['column'].max()
df['column'].max()
You can check out my sample notebook with all these commands in action with a dataset.
That’s it for now. Next time we’ll learn some more tips and tricks for data analysis. :)